COBS: A Background

COBS is a field trial comparing biofuel systems

First, let’s confirm that we have the required tools for asking questions of the NCBI database. To do this we will use the NCBI API: Entrez Direct and install it via the UNIX command line.

pwd
cd ~
perl -MNet::FTP -e \
  '$ftp = new Net::FTP("ftp.ncbi.nlm.nih.gov", Passive => 1);
   $ftp->login; $ftp->binary;
   $ftp->get("/entrez/entrezdirect/edirect.zip");'
unzip -u -q edirect.zip
rm edirect.zip
export PATH=$PATH:$HOME/edirect
./edirect/setup.sh

Let’s take a look to confirm that it was installed in the home directory:

cd ~
ls

Now that we have the tools, what will we use them for? COBS is a unique dataset that includes many metagenomes from soil aggregates. Many functions in ecosystems are microbially mediated and catalyzed by enzymes. In this study, funded by the DOE, we are interested in carnon cycleing and therefore the enyzmes associated with that funciton. Our collaborator (Kirsten Hofmockel) has identified a few enzymes of interest for our system. The list of enzymes is contained in a text file: ec_numbers.txt Let’s navigate to that directory and take a look:

cd ~/Documents/COBS_CAZY
cat ec_numbers.txt

We now have a list of enzymes of interest, let’s use those NCBI tools to find bacterial sequences associated with those enzymes by creating a python script:

import sys
from Bio import Entrez, SeqIO

Entrez.email = 'jflater@iastate.edu'

# First, find entries that contain the E.C. number
ec_num = sys.argv[1].strip()
#print ec_num
#print 'E.C. '+ ec_num
esearch_handle = Entrez.esearch(db='nucleotide', term='EC '+ec_num)
# When term='E.C. we get zero results, however, if term=EC it works
entries = Entrez.read(esearch_handle)
esearch_handle.close()

# Second, fetch these entries
efetch_handle = Entrez.efetch(db='nucleotide', id=entries['IdList'], rettype='gb', retmode='xml') 
records = Entrez.parse(efetch_handle)

# Now, we go through the records and look for a feature with name 'EC_number'
for record in records:
      for feature in record['GBSeq_feature-table']:
          for subfeature in feature['GBFeature_quals']:
              if (subfeature['GBQualifier_name'] == 'EC_number'   and
                subfeature['GBQualifier_value'] == ec_num):

                    # If we found it, we extract the seq's start and end
                    accession = record['GBSeq_primary-accession']
                    interval = feature['GBFeature_intervals'][0]
                    interval_start = interval['GBInterval_from']
                    interval_end = interval['GBInterval_to']
                    location = feature['GBFeature_location']
                    if location.startswith('complement'):
                        strand = 2
                    else:
                        strand = 1

                    # Now we fetch the nucleotide sequence
                    handle = Entrez.efetch(db="nucleotide", id=accession,
                                           rettype="fasta", strand=strand,
                                           seq_start = interval_start,
                                           seq_stop = interval_end)
                    seq = SeqIO.read(handle, "fasta")

                    print('>GenBank Accession:{}'.format(accession))
                    print(seq.seq)
efetch_handle.close()

Now let’s apply that script to our list of EC #’s by using the command line:

while read line;     
  do python scripts/nucl_from_ec.py $line > "$line".txt;    
  done < ec_numbers.txt
cd Documents/COBS_CAZY
head 3.2.1.37.txt 

This command line script will go through our list of EC numbers and retrieve the associated nucleotide sequences and output a file for each EC # with those sequences in fasta format. ____ Python script wasn’t working. So we are using downloaded protien sequences to blast against NCBI Let’s try an online server mkdir temp_cazy _____ To use hpcc: Lot’s of disc space, temporary unlimited (3 months)

ssh *****@hpcc.msu.edu password:

We are at the gateway, now connect to development node (dev-intell-16, it’s newer)

ssh dv-intell-16
is home directory, only 50 g and Jin can’t see it. cd /mnt/scratch/***** empty big directory, temp unlimeted good for downloading big files for a short time now lets download db to that ftp://ftp.ncbi.nlm.nih.gov refseq_protien.10.tar.gz.md5 and *.gz is what we want to put on HPCC

In HPC: move out of dev intel…everyone is using it. There are many more computers to choose from begind that…called nodes We want to work in a node, we need to submit a job by using qsub

[*****@dev-intel16 ~]$ cd /mnt/scratch/***** [*****@dev-intel16 *****]$ ls [*****@dev-intel16 *****]$ mkdir COBS_CAZY [*****@dev-intel16 *****]$ cd COBS_CAZY/ [*****@dev-intel16 COBS_CAZY]$ emacs download.refseq.qsub

[1]+ Stopped emacs download.refseq.qsub [*****@dev-intel16 COBS_CAZY]$ ls download.refseq.qsub download.refseq.qsub~ [*****@dev-intel16 COBS_CAZY]$ rm *~ [*****@dev-intel16 COBS_CAZY]$ ls download.refseq.qsub [*****@dev-intel16 COBS_CAZY]$ qsub download.refseq.qsub 37523906.mgr-04.i [*****@dev-intel16 COBS_CAZY]$ ls download.refseq.qsub [*****@dev-intel16 COBS_CAZY]$

.o is output, .e is error

S = Q in line S = R running S = C complete

Check if everything downloaded correctly by using md5(it tells us if downloaded correctly) Once all is downlaoded, unzip

Then ready to run blastp (files in fasta) clone the github repository # note change spellin on protein for emacs

module load git (to upload repository) git clone url once repository is uploaded then blastp

submit blastp: qsub blastp.qsub

2016_12_02:09:06 AM, Noticed that we did not qsub blastp for EC *.86, let’s make that script and submit the job.

ssh *****@hpcc.msu.edu
password:
ssh dev-intel16
cd /mnt/scratch/*****/COBS_CAZY
ls
---
title: "COBS_CAZY"
output: html_notebook
root: ../../../
---
![](images/COBSPIC.png)
------
#COBS: A Background

COBS is a field trial comparing biofuel systems
------
First, let's confirm that we have the required tools for asking questions of the NCBI database. To do this we will use the NCBI API: Entrez Direct and install it via the UNIX command line. 
```{bash, eval=FALSE}
pwd
cd ~
perl -MNet::FTP -e \
  '$ftp = new Net::FTP("ftp.ncbi.nlm.nih.gov", Passive => 1);
   $ftp->login; $ftp->binary;
   $ftp->get("/entrez/entrezdirect/edirect.zip");'
unzip -u -q edirect.zip
rm edirect.zip
export PATH=$PATH:$HOME/edirect
./edirect/setup.sh
```
Let's take a look to confirm that it was installed in the home directory:
```{bash}
cd ~
ls
```
Now that we have the tools, what will we use them for? COBS is a unique dataset that includes many metagenomes from soil aggregates. Many functions in ecosystems are microbially mediated and catalyzed by enzymes. In this study, funded by the DOE, we are interested in carnon cycleing and therefore the enyzmes associated with that funciton. Our collaborator (Kirsten Hofmockel) has identified a few enzymes of interest for our system. The list of enzymes is contained in a text file: ec_numbers.txt
Let's navigate to that directory and take a look:
```{bash}
cd ~/Documents/COBS_CAZY
cat ec_numbers.txt
```
We now have a list of enzymes of interest, let's use those NCBI tools to find bacterial sequences associated with those enzymes by creating a python script:  
```{python, eval=FALSE}
import sys
from Bio import Entrez, SeqIO

Entrez.email = 'jflater@iastate.edu'

# First, find entries that contain the E.C. number
ec_num = sys.argv[1].strip()
#print ec_num
#print 'E.C. '+ ec_num
esearch_handle = Entrez.esearch(db='nucleotide', term='EC '+ec_num)
# When term='E.C. we get zero results, however, if term=EC it works
entries = Entrez.read(esearch_handle)
esearch_handle.close()

# Second, fetch these entries
efetch_handle = Entrez.efetch(db='nucleotide', id=entries['IdList'], rettype='gb', retmode='xml') 
records = Entrez.parse(efetch_handle)

# Now, we go through the records and look for a feature with name 'EC_number'
for record in records:
      for feature in record['GBSeq_feature-table']:
          for subfeature in feature['GBFeature_quals']:
              if (subfeature['GBQualifier_name'] == 'EC_number'   and
                subfeature['GBQualifier_value'] == ec_num):

                    # If we found it, we extract the seq's start and end
                    accession = record['GBSeq_primary-accession']
                    interval = feature['GBFeature_intervals'][0]
                    interval_start = interval['GBInterval_from']
                    interval_end = interval['GBInterval_to']
                    location = feature['GBFeature_location']
                    if location.startswith('complement'):
                        strand = 2
                    else:
                        strand = 1

                    # Now we fetch the nucleotide sequence
                    handle = Entrez.efetch(db="nucleotide", id=accession,
                                           rettype="fasta", strand=strand,
                                           seq_start = interval_start,
                                           seq_stop = interval_end)
                    seq = SeqIO.read(handle, "fasta")

                    print('>GenBank Accession:{}'.format(accession))
                    print(seq.seq)
efetch_handle.close()

```
Now let's apply that script to our list of EC #'s by using the command line:
```{bash, eval=FALSE}
while read line;     
  do python scripts/nucl_from_ec.py $line > "$line".txt;    
  done < ec_numbers.txt
```

```{bash}
cd Documents/COBS_CAZY
head 3.2.1.37.txt 
```





This command line script will go through our list of EC numbers and retrieve the associated nucleotide sequences and output a file for each EC # with those sequences in fasta format. 
____
Python script wasn't working. So we are using downloaded protien sequences to blast against NCBI 
Let's try an online server 
mkdir temp_cazy
_____
To use hpcc:
Lot's of disc space, temporary unlimited (3 months)

ssh *****@hpcc.msu.edu
password:

We are at the gateway, now connect to development node (dev-intell-16, it's newer)

ssh dv-intell-16
~ is home directory, only 50 g and Jin can't see it.
cd /mnt/scratch/*****
empty big directory, temp unlimeted good for downloading big files for a short time
now lets download db to that  ftp://ftp.ncbi.nlm.nih.gov
refseq_protien.10.tar.gz.md5 and *.gz is what we want to put on HPCC

In HPC: move out of dev intel...everyone is using it. There are many more computers to choose from begind that...called nodes
We want to work in a node, we need to submit a job by using qsub
```{bash, eval=FALSE}

```

[*****@dev-intel16 ~]$ cd /mnt/scratch/*****
[*****@dev-intel16 *****]$ ls
[*****@dev-intel16 *****]$ mkdir COBS_CAZY
[*****@dev-intel16 *****]$ cd COBS_CAZY/
[*****@dev-intel16 COBS_CAZY]$ emacs download.refseq.qsub

[1]+  Stopped                 emacs download.refseq.qsub
[*****@dev-intel16 COBS_CAZY]$ ls
download.refseq.qsub  download.refseq.qsub~
[*****@dev-intel16 COBS_CAZY]$ rm *~
[*****@dev-intel16 COBS_CAZY]$ ls
download.refseq.qsub
[*****@dev-intel16 COBS_CAZY]$ qsub download.refseq.qsub
37523906.mgr-04.i
[*****@dev-intel16 COBS_CAZY]$ ls
download.refseq.qsub
[*****@dev-intel16 COBS_CAZY]$

.o is output, .e is error

S = Q in line
S = R running
S = C complete

Check if everything downloaded correctly by using md5(it tells us if downloaded correctly)
Once all is downlaoded, unzip



Then ready to run blastp (files in fasta) clone the github repository
# note change spellin on protein for emacs

module load git (to upload repository)
git clone url
once repository is uploaded then blastp

submit blastp: qsub blastp.qsub

#2016_12_02:09:06 AM, Noticed that we did not qsub blastp for EC *.86, let's make that script and submit the job. 
```{bash, eval=FALSE}
ssh *****@hpcc.msu.edu
password:
ssh dev-intel16
cd /mnt/scratch/*****/COBS_CAZY
ls

```

